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I 

BACKGROUND TO THE INVENTION 
Field of the Invention 

This invention relates to coding of audio signals into a data stream such that it 
can be edited at points synchronised to another data stream. It has particular, but not 
5 occlusive, appUcadon to a digital television transmission scheme requiring non-destructive 
splicing of the audio in the compressed domain at the associated video frame boundaries. 

Distal Television (DTV) systems allow several programmes to be broadcast 
over a channel of limited bandwidth. Each of these programmes has video and audio content. 
Some of these programmes may contain high quaUty multichannel audio (e.g., 5 channels 
10 that can be reproduced by home cinema systems). DTV production sites, networks and 
affiliates typically use video tape recorders and transmission lines for carrying all audio 
content. Much of this infi^structure has edacity for only two uncompressed audio chamiels, 
so multiple channels are normally lightly compressed and formatted before recordmg or 
transmission. Prior to emission (i.e., broadcasting to end-user) the programme streams are 

15 strongly compressed. 

In contribution and distribution stages of DTV production, original streams 
must be spliced for programme editing or programme switching (e.g., for insertion of local 
content into a Uve network feed). Such spUcing is performed at video frame boundaries 
within the conteat stream. 

20 The audio content of the broadcast stream must meet several requirements. 

DTV viewers may expect received programmes to have a high perceptive audio quahty, 
particularly when the programmes are to be reproduced using high quality reproduction 
equipment such as in a home cinema system. For example, there should be no audible 
artefacts due to cascadmg of multiple encoding and decoding stages, and there should be no 

25 perceptible interruption in sound during programme switching. Most importantly, the 
reproduced programmes must be lip sync; that is to say the audio stream must be 
synchronous with the correspondmg video stream. To achieve these ends at a reasonable 
cost, i.e., using the existing (2-channel) infrastructure, one.must splice the audio programme 
in the compressed domain. 
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Siimmarv of the Prior Art 

An existing mezzanine encoding scheme include Dolby E (r. t m.) defined in 
Dolby Digital Broadcast Implementation Guidelines Part No. 91 549, Version 2 1998 of 
Dolby Laboratories for distribution of up to 8 chaimels of encoded audio and multiplexed 
metadata through an AES-3 pair. The soon to be introduced (NAB 1999) DP571 Dolby E 
Encoder and DP572 Dolby E Decoder should allow editing and switching of encoded audio 
with a minimum of mutes or glitches. Moreover, they allow cascading without audible 
degradation. Dolby E uses 20-bit sample size and provides a reduction between 2:1 and 5:1 
in bitrate. 


ACTS ATLANTIC project, a flexible method for switching and editing of MPEG-2 video 
bitstreams. This seamless concatenation approach uses decoding and re-encoding with side 
infomiation to avoid cascading degradation. However, this scheme is limited to application 
with MPEG-2 Layer II and the AES/EBU interEace. Moreover, the audio data is allowed to 
slide with respect to edit points introducing a time offset. Successive edits can result, 
therefore, in a large time offset between the audio and video information. 


maintained in lip sync. That is to say, the audio must be kept synchronous to the 
corresponding video. Prior to emission, distribution sites may splice (e.g., switch, edit or 
mix) audio and video streams (e.g., for inclusion of local content). After splicing, if video 
and audio frame boundaries do not coincide, which is the case for most audio coding 
schemes, it is not possible to automatically guarantee lip sync due to slip of the audio with 
respect to the video. In extreme cases, when no special measures are taken, this could lead to 
audio artefacts, such as mutes or glitches. Glitches may be the result of an attempt to decode 
a not compliant audio stream while mutes may be applied to avoid these glitches. An aim of 
this invention is to provide an encoding scheme for an audio stream that can be spliced 
without introducing audio artefacts such as mutes, glitches or slips. 

Another aim of this invention is to provide an encoding scheme that can be 
subject to cascading compression and decompression with a minimal loss of quality. 


The British Broadcasting Corporation and others are proposing, through the 


Throughout the broadcasting chain, video and audio streams must be 


SUMMARY OF THE INVENTION 

From a first aspect, the invention pro\ddes an audio encoding scheme for a 
stream that encodes audio and video data, which scheme has a mean effective audio frame 
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length F tiiat equals the video frame length over an integral niunber M video frames, 

by provision of audio frames variable in length F in a defined sequence F{j^ at encoding. 

This scheme ensures tiiat the stream can be edited at least at each video frame 
without degradation to the audio information. Preferably, the frame length F may be adjusted 
5 by varying an overlap O between successive audio frames. 

In schemes embodying the invention, the value F(j) may repeat pmodically 
on j\ the periodicity of F{j) defining a sequence of frames. There is typically M video and 
N audio frames per sequence, each audio frame being composed of A: blocks. The total 
overlap Or between frames in the sequence may be, for example, equal to 
10 CV =/7xC? + ^x(o + l), where O is an overlap length in blocks. 

In one scheme within the scope of the invention, only audio frames 
corresponding to a particular video frame are overlapped. In such a scheme, the values of p 
and ^ may meet the following equalities: p = (iS^-M)x(0 + l)-Oy- and g = {N-M)-p. 

In an alternative scheme, only audio frames corresponding to a particular 
15 video sequence are overlapped. In such a scheme, the values ofp and q may meet the 
following equalities: p = (7V-l)x(0 + l)-0r and =:(iV-l)-/7 . 

In a fiirther alternative scheme, any adjacent audio frames are overlapped. In . 
such a preferred scheme, the values ofp and q may meet the following equalities: 
p = Nx{0^l)-Oj. and q = N- p . This latter scheme may provide optimal values of 

20 overlapforasequenceofvideo frames Af such that 3we5<* :^IX^ = Mx — f . 

We define a video sequence as an integer (and possible finite) number of video 
frames (i.e., M ) at a rate of fy video frames per second, each video frame containing an 
equal integer number N of (compressed) audio frames, each audio frame containing an 
integer numbw k of blocks, each block representing an integer ntimber t of audio samples at 
25 a sampling rate of samples per second. By making the remainder of the division between 
the number of video frames times the quotient between audio and video frequencies, and the 
number of audio samples per block of (compressed) audio equal to zero, M is guaranteed to 
be an integer. Thus, N is also an integer. Consequently, the total number of overlapping 
blocks is also an integer and so is each single overlap. That the number of overlapping 
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blocks is an integer is, in most cases, a requirement. Blocks of samples are the smallest units 
of information handled by the underlying codec. 

From a second aspect, the invention provides an audio encoding scheme for a 
stream that carries encoded audio and video data in which scheme audio samples of N quasi 
5 video-matched frames are encoded in frames with a semi-variable overlap whereby the 
effective length of the audio frames coincides with the length of a sequence of M video 
frames, where M and N are positive integers. 

The invention provides a data stream encoded by a scheme according either 
preceding aspect of the invention." Such a stream may include audio frames, each of which is 
10 tagged to indicate the size of the audio frame. The blocks may be similarly tagged to indicate 
whether the block is a redxmdant block. 

From another aspect this iavention provides an audio encoder (that may be 
implemented for example as a software component or a hardware circuit) for encoding an 
audio stream according to the first aspect of the invention; and it further provides an audio 
15 decoder for decoding an audio stream according to the first aspect of the invention. 

An audio decoder according to this aspect of the invention operate by 
modifying the redundancy status of blocks in the data stream by application of one or more of 
a set of block operators to each block. This may be accomplished by a set of operators that 
includes one or more of: NOP, an operator that does not change the status of a block; DROP, 
20 an operator that changes the first non-redundant block firom the head overlap into a redundant 
block; APPEND, an operator that changes the first redxmdant block from the tail overlap into 
a non-redundant block; and SHIFT, an operator that is a combination of both DROP and 
APPEND operators. 

In particular, the invention provides an audio encoder for coding audio for a 
25 stream that encodes audio and video data in which the encoder produces audio frames of 
variable length such that a mean effective audio firame length F equals the video frame 
length over an integral number M video fi:ames, by provision of audio frames variable 

overlap to have length in a defined sequence F(J) at encoding. 

Such an audio encoder may code a stream to have a short overlap of length O 
30 and a total of q long overlaps in a sequence, the encoder calculating the head overlap using an 
algorithm that repeats after N audio firames. 

From a further aspect, the invention provides an audio decoder (that may be 
unplemented for example as a software component or a hardware circuit) for decoding a 
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stream that carries encoded audio and video data, which decoder calculates an expected 
frame length of an incoming firameF in a, possibly circular shifted, sequence FU)» adjusts 
the actual length of the incoming frame to make it equal to the expected frame length, 
determines whether any block within a received frame is a redundant block or a non- 

5 redundant block, mapping the non-redundant blocks onto sub-band audio samples. 

In systems embodying the invention, there is typically no extra manipulation 
of the audio, such as sample rate conversion. Moreover, all information needed to correctly 
decode the received stream is most typically added at the encoder and there is no need to 
modify this information during editing. Therefore, editing may be done using the existing 

10 infrastructure with no modifications. Furthermore, very little extra information need be 
added to the stream in order to make decodmg possible. Last, but not least, when using 
MPEG as the emission format, it may be convenient to also use an MPEG-like format for 
transmission. 

1 5 DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION 
An embodiment of the invention will now be described in detail, by way of 
example only, and with reference to the accompanying drawings, in which: 

Figure 1 is a diagram of a typical chain uivolved in DTV broadcasting; 
Figure 2 is a diagram showing the principal components of a typical DTV 

20 production site; 

Figure 3 is a diagram showing the principal components of a typical DTV 

network site; 

Figure 4 is diagram that shows the arrangement of audio and video frames 
within a stream encoded m accordance with a first approach in an embodiment of the 
25 invendoiu 

Figure 5 is diagram that shows the arrangement of audio and video frames 
within a stream encoded in accordance with a second ^proach in an embodiment of the 
invention; 

Figure 6 is diagram that shows the arrangement of audio and video frames 
30 within a stream encoded in accordance with a thkd approach in an embodiment of the 
invention; 

Figure 7 shows the bit allocation of a stream embodymg the invention, based 
on MPEG-2 Layer H, for NTSC and 48kHz audio in IEC61937; and 
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Figure 8 is a diagram of the arrangement of blocks in a stream encoded by an 
embodiment of the invention. 

In the following description, the following symbols are used throughout: 


> fv audio sampling frequency, video frame rate 

tj^ , ty audio, video frame duration length 

s samples per audio frame 

k blocks of samples per audio frame 

t samples per block 

O^^O short, total and average overlap 

M , N quantity of video, audio frames per sequence 

p quantity of short overlaps per sequence 

q quantity of long overlaps per sequence 

j frame index 

F{j)y G{j) frame's effective length 

H{j) , T{j) frame's head, tail overly 

X{j) , X{j) accumulated effective length, accumulated mean effective length 

F mean effective length 

b short frame's length 

B total number of blocks in video sequence 

<P Phase 

{1,2,3,... ,00} 

Q null padding 

A{j) append operation toggle 

OPU) Operator 

^U) synchronisation error 

S total synchronisation error 

u , V auxiliary variables 


With reference first to Figure 1, a typical DTV broadcasting system is a chain 
involving a contribution stage 10, a distribution stage 12 and an emission stage 14. 

In the contribution stage, content is originated at one or more production sites 
20, and transferred by a distribution network 22 to a broadcast network site 24. The 
broadcast network 24 produces a programme stream that includes the content, and distributes 
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the programme stream over a distribution network 30 to affiliates, such as a direct-to-home 
satellite broadcasts 32, a terrestrial broadcaster 34, or a cable television provider 36. A 
subscriber 40 can then receive the programme stream jfrom the output of one of the affiliates. 

Within the production site, content of several types may be produced and 
5 stored on dififerent media. For example, a first studio 50 may produce live content and a 
second studio 52 may produce recorded content (e.g. commercial advertisements)- In each 
case, the content includes a video and an audio component. Output from each studio 50 is 
similarly processed by a respective encoder 54 and to generate an elementary stream that 
encodes the audio and video content. The content from tiie first studio 50, to be broadcast 

10 live, is then transmitted to the distribution network 22 by a radio link (after suitable 

processing). Time is not critical for the content of the second studio, so this may be recorded 
on tape 56 and sent to the distribution network 22 in an appropriate maimer. The encoder 54, 
and the elementary stream that it produces, are embodiments of aspects of the invention. 

Within title network site 24, as shown in Figure 3, content from various sources 

15 is spliced to construct a programme output by a splicer 60. Input to the splicer 60 is derived 
from elementary streams of similar types that can be derived from various sources such as via 
a radio link from the production unit 20, a tape 56 or a local studio 64. Output of the splicer 
60 is likewise an elementary stream that, at any given time, is a selected one of the input 
streams. The splicer 60 can be operated to switch between the input streams in a maimer that 

20 ensures that the audio and video components of the ou^ut stream can be seamlessly 

reproduced. Output of the splicer 60 is then processed by a packetiser 62 to form a transport 
stream. The transport stream is then modulated for transmission by a radio link to the 
affiliates for distribution to subscribers. 

The video content encoded within an elementary stream embodying the 

25 invention will ^ically comprise a sequence of scanned video frames. Such frames may be 
progressive scanning video frames, in which case, each frame is a complete still picture. In 
such cases, the video frames have a frame rate and is of duration ty = l/fy . 
Alternatively, the frames may be interlaced scanning frames in which each frame is built up 
from two successive interlaced fields, the field frequency being 2fy in the notation 

30 introduced above. The frame rate and scanning type is defmed by the television system for 
which the stream is intended. Basic TV standards PAL and NTSC derived the frame rates 
from title mains frequency of the coimtries where the standards were used. With the 
introduction of colour, NTSC was modified by a fector 1000/1001 . Additionally, film uses 
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24Hz, which may be modified by the seune factor. Moreover, computer monitors can run at 
several &ame rates up to 96Hz, Typical values of fy are given in Table 1, below. 


Viaeo irame rate \jnuL\ 

ty [ms] 

Application 

23.976 

41.71 

3-2puU-downNTSC 

24 

41.0/ 


25 

40 

PAL, SECAM 

29.97 

33.37 

NTSC, PAL-M, SECAM-M 

.30 _ . 

33.33 _ 

. drop-fiameNTSC 

50 

20 

double-rate PAL 

59.94 

16.68 

double-rate NTSC 

60 

16.67 

double-rate, drop-firame NTSC 


Table 1 

The audio signal is a time-continuous pulse-code modulated (PCM) signal 
5 sampled at a iBrequency , for example, 48kHz. Example values of are given in Table 2, 
below. 


Audio sampling frequency |lcHz] 

Application 

24 

DAB 

32 

DAT, DBS 

44.1 

CD, DA-88, DAT 

48 

professional audio, DA-88, 


DVD 

96 

DVD 


Table 2 

Besides these frequencies, it is also possible to find 44.1 and 481cHz modified 
by a factor 1000/1001 (e.g., 44.056, 44.144, 47^952 and 48.048kHz) for conforming audio in 
10 pull-up and pull-down film-to-NTSC conversions. Additionally, for film-to-PAL 
conversions, a 24/25 factor may be applied (e.g., 42.336, 45.937, 46.08 and 50kHz). 
Moreover, DAB may use 24 and 48kHz; DVD-Audio may use 44.1, 88.2, 176,4, 48, 96 and 
192kHz; DVD-Video may use 48 and 96kHz. DAT is specified for 32, 44.1 and 48kHz; 
special versions may use also-96kHz. Finally, compressed audio at very low bit rates may 
15 require lower sampling frequencies (e.g., 16, 22.05 and 24kHz). 

The sample width is typically 16, 20 or 24 bits. 
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Before compression, the audio stream is divided in audio frames of duration 
= s/f^ , where s is the number of samples per audio frame (e.g., in MPEG-2 Layer II s = 
1 1 52 samples; in AC-3 j = 1 536 samples). Examples of frame rates used in various coding 


schemes are shown in Table 3, below. 


Codinff scheme 

Use 

Frame length 

[ms] @ 



[samples] 

48kHz 

MPEG-1 Layer I 

DCC 

384 

8 

MPEG-1 Layer n 

DAB,DVB,DVD- 

1,152 

24 


V 



MPEG-1 Layer m 

ISDN,MP3 

1,152 

24 

MPEG-2 Layer 11 

DVB, DVD 

1,152 

24 

MPEG-2 AAC 


1,024 

21.33 

Dolby AC-3 

DVD 

1,536 

32 

Sony ATRAC 

MiniDisc 

512 

n.a. 


5 Table 3 

Inside the audio encoder, audio frames are further divided into k blocks of t 
samples (e.g., in MPEG-2 Layer H there are 36 blocks of 32 samples). The blocks are the 
smallest unit of audio to be processed. This may be expressed ass-kxt. Table 4 below 
presents of examples of frame sub-divisions used in various codmg schemes. 


Coding scheme 

kxt [blocks X samples] 

MPEG Layer I 

12x32 

MPEG Layer 11 

36x32 

MPEG Layer III 

2x576 

Dolby AC-3 

6x256 


10 Table 4 

Throughovitthe broadcasting chain, video and audio streams must be 
maintained in lip sync. That is to say, the audio must be kept synchronous to the 
corresponding video. Prior to endssion, distribution sites may splice (e.g., switch, edit or 
mix) audio and video streams (e.g., for inclusion of local content). 
15 After splicing, if video and audio frame boundaries do not comcide, which is 

the case for most audio coding schemes, it is not possible to automatically guarantee lip sync. 
In extreme cases, when no special measures are taken, this could lead to audio artefacts, such 
as mutes or slips. 
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Although the various embodiments of the invention can perfomi an encoding 
related to existmg standards (such as MPEG-1 and MPEG-2) the embodiments are not 
necessarily backward compatible Avith these existing standards. 

BASIS OF THE EMBODIMENTS 

In title coding scheme of the present embodiment, the audio samples in N 
quasi video-matched frames, with a semi-variable overlapping to coincide with a sequence of 
M video frames. Upon encoding in accordance with an embodiment of the invention, each 
video frame contains an equal integer number of audio frames. Therefore, editing may be 
done at video frame boundaries. Upon decoding, redundant samples may be discarded. 

Assimiing an audio frame is divided in k blocks of ^ samples, the total overlap 
Ot, in blocks, may be calculated by: 

Or^ikxN^^^x^^ Equation 1 

where M, k and t are positive integers and and fy , represent frequencies in Hz, are 
such that '^i/C is a rational number. 


For providing cross-fade between edited audio streams within the decoder 
reconstruction filters, the total overlap Or is chosen to coincide with an integer number of 
blocks, as given by: 

Or =pxO + qx{0 + l) Equation2 
where /?, q and O are non-negative integers. 

Within various embodiments of the invention various approaches can be 
adopted for spreading the total overlap through the audio frames. That is, by imposing 
dijfferent restrictions one may give different implementations for these embodiments. Three 
such approaches are referred to herein as: 

Approach 1 - overlaps within video frame; 

Approach 2 — overlaps within sequence of video frames; and 

Approach 3 - overlap throughout the video stream. 

It can be shown that Approach 3 always offers the smallest possible overlap 
between two adjacent audio frames, often with the smallest mmiber of video frames per 
sequence. Therefore, for many applications, this approach will be preferred to the others. 
However, depending upon the particular application, this may not always be the case. 
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Approach 1 

When the overlaps exist only within one video frame, as in Figure 4, the average overlap O , 
in blocks, is given by: 

EquaaonS 

5 which may be implemented as 

p = (iV - M) X (O + 1) - Oy. Equation 4 

overlaps of length O blocks and 

q = {N-M)-p Equations 

overlaps of length (O + l) blocks. 

10 

Approach 2 

When the overlaps exist only within one sequence, as in Figure 5, the average overlap O , in 
blocks, is given by: 

5 = ^ Equad<».6 

1 5 which may be implemented as 

p^{N-l)x{p + l)-Or Equation? 
overlaps of length O blocks and 

q^(N'-l)-p Equation 8 

overlaps of length (O + l) blocks, 

20 

Approach 3 

When the overlaps exist within sequences, as in Figure 6, the average overlap 
O , in blocks, is given by: 

O = — Equation 9 

N 

25 which may be implemented as 

p^Nx{0'¥l)-Or Equation 1 0 

overlaps of length O blocks and 

q = N — p Equation 11 

overlaps of length (O +l) blocks. 
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The overlap length O may be expressed as 

o=LoJ 

which, for the last approach, can be written as: 


5 M is chosen to satisfy: 


Jv. 



and the rate of audio frames per video frame ^/j^ may be written as: 


Equation 12 


Equation 13 


Equation 14 


M 


£2 


Equation 15 


10 Cross-fade 

The reconstruction filter in an MPEG-1 decoder as defmed in ISO/IEC 11 172 
"Coding of moving pictures and associated audio for digital storage media at up to about 1 .5 
Mbit / s" Part 3; Audio (1993-08) is an overlapping fiUiter bank. If splicing is done in the sub- 
band domain - i.e., blocks - that results on a cross-fade of about 512 audio samples upon 
15 decoding. 


IMPLEMENTATION OF EMBODIMENTS BASED ON COMMON CODING 
STANDARDS 

Various encoding schemes have been considered as a basis for embodiments 
20 of the invention: In particular, MPEG-1 and MPEG-2, Layers I and II have been considered, 
but this is by no means an exclusive list of possible schemes. It must be said here that 
schemes embod)nbnLg the invention use coding schemes similar to existing standards but, due 

. - to overlapping, they deviate from these standards 

As will be familiar to those skilled in the technical field, the MPEG-2 is a 
25 standard scheme for encoding multichannel audio backward compatible with MPEG-1 . On 
the other hand, a non backwards compatible extension of the MPEG-1 standard to 
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multichamiel may offer implementation simplicity. Moreover, Layer II is more efficient than 
Layer L On tte other hand. Layer I offers less encoding redundancy due to its having a 
smaller number of blocks, A scheme based on MPEG-1 Layer I may offer the best 
combination of low redundancy and implementation simplicity in embodiments of the 
5 invention. 

MPEG-2 Layer II 

When using MPEG-2 Layer 11 as a basis for the encoding scheme, * = 36 and 

/ = 32. 

10 Table 5 shows some examples of overlap sequences for various combinations 

of audio sample frequencies and video frame rates when the embodiment is based upon 
Approach 1, as described above. 


/kLHz] 

fA 

[kHz] 

M 

N 

Or 

O 

pxC> + ^x(0 + l) 

23.976 

48 

16 

32 

151 

9.437... 

9x9 + 7x 10 

44.1 

2,560 

5,120 

37,173 

14.520... 

1,227 X 14+ 1,333 X 15 

32 

24 

48 

727 

30.291... 

17x30 + 7x31 

24 

48 

2 

4 

19 

9.5 

1x9 + 1x10 

44.1 

64 

128 

933 

14.578... 

27x14 + 37x15 

32 

3 

6 

91 

30.333... 

2x30+1x31 

25 

48 

1 

2 

12 

12 

1 X 12 + Ox 13 

44.1 

8 

16 

135 

16.875 

1 xl6 + 7x 17 

32 

1 

2 

32 

32 

1 x32 + 0x33 

29.97 

48 

20 

40 

439 

21.95 

1 X 21 + 19 x 22 

44.1 

3,200 

6,400 

83,253 

26.016... 

3,147x26 + 53x27 

32 

n/a 

n/a 

n/a 

n/a 

n/a 


Table 5 MPEG-2 Layer n and Approach 1 

15 

Table 6 shows some examples of overlap sequences for diverse combinations of audio 
sample frequencies and video frame rates when the embodiment is based upon Approach 2, 
as described above. 
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/k[Hz] 

PcHz] 

M 

N 

Or 

O 

pxO+qx{0+l) 



16 

32 

151 

4.870... 

4 X 4 + 27 X 5 


48 

32 

64 

302 

4.793... 

13x4 + 50x5 



48 

96 

453 

4.768... 

22 x 4 + 73 x5 

23.976 

44.1 

2,560 

5,120 

37,173 

7.261... 

3,779 x 7+1,340 x8 



24 

48 

727 

15.468... 

25x15+22x16 


32. 

4.8 . 

96 

.1,454 . . 

15305-. 

66.X.15.+ 29X 16 



72 

144 

2.181 

15,251... 

107x15 + 36x16 



2 

4 

19 

6.333... 

2x6+1x7 


48 

10 

20 

95 

5 

19x 5 + 0x6 



48 

96 

456 

4.8 

19x4 + 76x5 



64 

128 

933 

7.346... 

83 x 7 + 44 X 8 

24 

44.1 

128 

256 

1,866 

7.317... 

174x7 + 81 x 8 



192 

384 

2,799 

7.308... 

265x7 + 118x8 



3 

6 

91 

18.2 

4x18+1x19 


32 

6 

12 

182 

16.545... 

5x16 + 6x17 



24 

48 

728 

15.489... 

24x15 + 23x16 



1 

2 

12 

12 

1 x 12 + Ox 13 


48 

2 

4 

24 

8 

3x8+0x9 



7 

14 

84 

6.461... 

7x6+6x7 

25 

44.1 

8 

16 

135 

9 

15x9 + 0x10 



72 

144 

1,215 

8.496... 

72x8 + 71x9 



1 

2 

32 

32 

1 X 32 + 0 X 33 


32 

2 

4 

64 

21.333... 

2 x 21 + 1 X 22 



17 

34 

544 

16.484... 

17x16 + 16x17 



20 

40 

439 

11.256... 

29x11 + 10x12 


48 

40 

80 

878 

11.113... 

70x11+9x12 



220 

440 

4,829 

11 

439x11 +0x12 

29.97 

44.1 

3200 

6,400 

83,253 

13.010... 

6,333x13 + 66x14 



6400 

12,800 

166,506 

13.009... 

12,680x13 + 119x14 



30 

30 

79 

2.724... 

8x2 + 21x3 


32 
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60 

60 

158 

2.677... 

19x2 + 40x 3 

90 

90 

237 

2.662... 

30 X 2 + 59 X 3 


Table 6 MPEG-2 Layer n and Approach 2 


Table 7 shows some overlap sequences for various combinations of audio sample frequencies 
and video frame rates when the embodiment is based upon Approach 3, as described above. 


[Hz] 

/a 

pcHz] 

M 

XT' 

Jy 

Or 

o 



48 

16 

32 

151 

4.718-.. 

9 X 4 + 23 X 5 

A A 1 

44.1 

2,5oU 



7 9^0 
/ .^Uv... 

3 787 X 7 + 1 333 x 8 

32 

24 

48 

727 

15.145... 

41x15 + 7x16 

24 

48 

2 

4 

19 

4.75 

1x4+3x5 

44.1 

64 

128 

933 

7.289... 

91 x7 + 37x8 

32 

3 

6 

91 

15.166... 

5x15 + 1x16 

25 

48 

1 

2 

12 

6 

2x6+0x7 

44.1 

8 

16 

135 

8.437... 

9x8+7x9 

32 

1 

2 

32 

16 

2 X 16 + Ox 17 

29.97 

48 

20 

40 

439 

10.975 

1x10 + 39x11 

44.1 

3200 

6400 

83,253 

13.008... 

6,347 X 13 + 53 X 14 

32 

30 

30 

79 

2.633... 

11 x2 + 19x3 


Table 7 MPEG-2 Layer II and Approach 3 


10 


MPEG>2 Laver I 

When using MPEG-2 Layer I as the encoding scheme, k = 
3» we obtain the sequences shown in Table 8. 


12andf = 


32, By using Approach 
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fv [Hz] 

fA 

[kHz] 

M 

N 

Or 

O 

pi<0+qx{0+l) 

23.976 

48 

16 

96 

151 

1.572... 

41x1+55x2 

44.1 

2,560 

12,800 

6,453 

0.504... 

6.347 X 0 + 6,453 x 1 

32 

24 

96 

151 

1.572... 

41x1+55x2 

24 

48 

2 

12 

19 

4.75 

5x1+7x2 

44.1 

64 

384 - 

933 

2.429... 

219 X-2 + 165x3 

32 

3 

12 

19 

1.583... 

5x1 +7x2 

25 

48 

1 

5 

0 

0 

5x0+0x1 

44.1 

8 

40 

39 

0.975 

1 x0 + 39xl 

32 

1 

4 

8 

2 

4x2+0x3 

29.97 

48 

20 

100 

199 

1.99 

1 X 1 + 99 x 2 

44.1 

3,200 

12,800 

6,453 

0.504... 

6,347x0 + 6,453x1 

32 

30 

90 

79 

0.877... 

11x0 + 79x1 


Table 8 MPEG-2 Layer I and Approach 3 

It should be noted that the average redundancy is much less than is the case 
when usmg Layer 11. 

5 

MPEG-1 

Another simplification that could be applied to embodiments is the use of 
MPEG-1 as the basis for the encoding scheme. In this case, the upper limit of two channels 
(e.g., stereo) of MPEG-1 can be extended to n channels. Therefore, each channel can have a 
10 bit allocation dependent on the total bit availability and on audio content per channel. 

ALGORITHMS 

In the following section, algorithms applicable to calculating overlaps 
according to Approach 3 will be described. 

15 

Encoding 

An encoder for creating an embodiment stream creates a sequence of £rames of 
a predetermined structure. Each frame J has the structure shown in Table 9 below, where k is 
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the total number of blocks, J^O) is the number of blocks in the head overlap and TU) is the 
number of blocks in the tail overlap- 



Table 9 

Note that T{j)^H{j^\). 
5 Knowing the value of N ^ O and q , the encoder may calculate the exact head 

overlap \ising the following algorithm, 
while (new frame) { 

if (counter >= N || counter = 0) { 
overlaps 0 + 1; 
1 0 counter = counter % N; 

} 

else overlap = O; 
return (overlap); 
coimter = counter + q; 

15 } 

In the case ofMPEG-2 Layer II, fy ^24Hz and = 48Aife , we have JQrom Table 7 that 
// = 4 , O = 4 and ^ = 3 . That generates the following sequence of head overlaps: 5, 4, 5 

and 5, or any circular shift thereof 

Every audio frame must be tagged to indicate its size. In the above-described 
20 scheme, the head overlap may be only O or O + 1 long. Therefore, it is possible to use a 1- 
bit tag to differentiate short and long frames. 

The useftil size F(j) of the frame j within a video sequence is given by: 

F(j) = * - i^O + 1) Equation 16 

Every block must be tagged to indicate its redimdancy. In the above-described 
25 scheme, the block may be only redundant or not redundant Therefore, it is possible to use a 
1-bit tag to differentiate redundant and non-redundant blocks. 

Recording and Transmission 

Although redundant information must be encoded, it need not all be 
30 transmitted. This saves bitrate in the transmitted stream. The minimum total number of 
blocks to recorded or transmitted within a video sequence, is given by: 


14.01.2002 


Eqviation 17 


An extra redundant block per audio frame may be needed to allow for editing 
the encoded stream. In this case, the maximum total number of blocks , to be recorded 
or transmitted within a video sequence, is given by: 


10 


15 


^MAX 



Qt. 




"Equation 18 


A phase q> may be defined to indicate the relative start, in blocks, of the 
encoded stream with respect to the first video frame in the video sequence. A smtable choice 
for (p is: 


O 
2 


Equation 19 


Moreover, the encoder generates null padding Q to complete the stream in 
accordance with the IEC61937 standard. The length of padding depends not only on the 
payload length but heis also to take into consideration video boundaries to avoid a cumulative 
error being introduced into the encoded stream. 


Editing 

Editing of the stream encoded in accordance with the embodiment may be 
performed at video frame boimdaries by adding, removing or appending frames. The 
20 decoder corrects the errors that may be generated by editing using information available 
within the decoder (such as values of and fy) or information generated by the encoder 
(such as size tags). No additional information need be recorded or transmitted as a result of 
editing. Moreover, cross-fade at the editing point may be provided by a reconstruction filter 
bank within the decoder. 

25 

Decoding 

A decoder for decoding a stream calculates the expected useful size F(j) fox. 
the current frame J . Moreover, it reads a size tag from the incoming frame to determine the 
actual useftil size G(jf) , 
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Blocks within an audio frame may have one of two statuses: redundant or non- 
redundant Non-redundant blocks are recorded, transmitted and decoded into sub-band 
samples. Redundant blocks (such as the first redundant block in the tail overlap) may be 
recorded and transmitted in order to ease the decoding process. However, redundant blocks 
5 are never decoded into sub-band samples. 

For modifying the status of an overlap block, four operators are defined: M?P, 
DROP, APPEND and SHIFT. 

NOP: The NOP operator does not change the status of blocks. 

DROP: The DROP operator changes the first non-redundant block firom the head overlap into 
10 a redvmdant block. 

APPEND: Tho APPEND operator changes the fijrst redundant block from the tail overlap into 
a non-redundant block. 

SHIFT: The shift operator is a combination of both DROP and APPEND operators. 

The decoding of frames in a stream embodying the invention into sub-band 
15 samples is referred to as mapping. Only non-redundant blocks are mapped into sub-band 

samples. If the incoming frame is larger than expected, the operator DROP is applied. 

Conversely, if the incoming frame is smaller than expected the operator APPEND is applied. 

When the actual size equals the expected size, the decoder looks to the previous frame. If 

the previous frame has been appended or shifted, the operator SHIFT is applied, otherwise, 
20 the incoming frame is mapped without modification. 


Synchronization error 

A stream embodjdng the invention is based upon the creation of a mean 

effective audio frame length F that equals the video frame length J/j^^ by alternation of 

25 long (i.e., tagged) and short frames in a defined sequence FiJ) at encoding. The redundancy 
needed for reproducing the previous defined sequence F(j) of long and short frames at 
decoding, despite the actual length GU) of the mcoming frames after editing, is obtained by 
overlapping frames at editing points. At editing, the synchronisation error €(J) , in blocks, 
due to swapping frames may be expressed by 



Equation 20 


At any time one may write 

jxp^u + Nxv, Equation 2 1 
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with u e {0,lA...,iV^-l} and v e {0,l>2,--,i?}. By substitution, it follows 

s(j) = ^ , Equation 22 

whence 0 < Sj^;^ < 1 - .Upon decoding, those redundancies are discarded appropriately 

by using operators NOP^ DROP, APPEND and SHIFT as described above. Moreover, the 
5 incoming frame G(j) may be delayed by one block due to a DROP or SHIFT operation. 

Therefore, it can be shown that the total synchronisation error 8 introduced by the process is 
bound, as follows :„ ._ _ _ 

AT = 0 5 € 1^0,1 - j A A/ = - 1 => ^ e j^-- 1,- -^j Equation 23 

with limits: 

10 -1 ^ Sj^j^ < 1 Equation 24 

Cascading 

Several cascading levels of lossy encoding and decoding may degrade the 
signal. However, the use of low compression rates at contribution and distribution, use of 

1 S metadata relating to the compressed signals and special techniques can be employed to keep 
this degradation imperceptible to the end-user. Methods applicable to MPEG encoding are 
known to those working in the technical field (for example, as described in 'Maintaining 
Audio Quality in Cascaded Psychoacoustic Coding*', Wamer R. Th. ten Kate 101st AES 
Convention 1996 November 8-1 1), which may be used with embodiments of the invention to 

20 maintain the quality of the audio signal throughout the DTV broadcasting chain. 

EXAMPLES OF THE INVENTION 
Block arrangement 

The audio frame sequence, encoded in accordance with an embodiment of this 
25 invention, for film and professional audio based on MPEG-2 Layer II and approach 3 

overlaps is shown in Table 10. All possible arrangement of blocks after decoding the stream, 
according to another embodiment of this invention, are shown in Figure 8. The parameters 
are as follows (referring to the list of symbols, above): 
video frame rate fy = 24 Hz, video frame length ty = 41.67 ms; 

30 audio sampling frequency = 48 kHz, audio frame length = 24ms; 
k-36 blocks, / = 32 samples; 
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M = 2 video frames, = 4 audio frames; 

overlap: Or =19 blocks, O =4.75 blocks, 0 = 4 blocks, 0+l = 5 blocks; 

j7 = 1 short overlap, q = 3 long overlaps; 

Z> = 3 1 blocks, + 1 = 32 blocks; 

5 = 125 . Bj^ = 129 , 9> = 2 blocks; 

f [0,0.75) <=Ar = 0 
=0.75 block, ^^|[_i^_o.25)c= A/ =-1 


J 

1 

2 

3 

4 

Hij) 

5 

4 

5 

5 

F{j) 

32 

31 

31 

31 


Table 10 

10 Application of the system to the IEC619 37 standard 

A suitable standard for transmitting the stream embodying the invention, is the 
IEC61937 standard ('Interface for non-linear PCM encoded audio bitstreams applymg lEC 
60958*). In the stream allocation shown in Figure 7 for the previous example: 

• The IEC61937 frame has a length ( 16 / 32 ) x 3.072 Mbit/s / fy . For fy = 24Hz, it 

15 corresponds to 64,000 bits. 

• The preambles: Pa = F872h, syncword 1 ; Pb = 4ElFh, syncword 2; Pc = burst 
information; Pd = number of bits < 65 536, length code. 

• Repetition period of data-burst is a number of IEC60958 frames. 

• Relative timing accuracy between audio and video after editing a VTR tape and delays 
20 mtroduced by switcher systems gap determine the minimum gap needed between two 

frames. This so-called splicing gap may be obtained by means of null-frame stuffing. 
This can be summarised as: 

• Stuffing = splicing gap + burst spacing; splicmg gap = tape + switch inaccuracy; burst 
spacing = 4 x IEC60958 "0" sub-frames, each 4096 x IEC60958 frames. 

25 • Burst-payload: System frame = ( N/M ) x [System sub-frame - head overlap]; N = 4; 
M = 2; N/M =2. 
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If the stream embodying the invention is based on MPEO-2 Layer H for 5.1 
channels at 384 kbit/s the system requires at most 45,504 bits (2 x [ (1 ,1 52 - 4 x 32) x 384 / 
48 + ( 2,047 - 4 X 32 / 1,152 X 2,047 ) X 8 ] + 0). 

^tead, if the stream embodying the invention is based on an 6-chaimel 
5 version of MPEG-1 Layer 11 at 1 92 kbit/s per channel for 6 channels, it would require at most 
49,152 bits (2 x (1,152 - 4 x 32) x 6 x 192 / 48 + 0). If we take into account that the LFE 
channel requires only 12 samples per frame, the effective bitrate would be {^proximately 230 
kbit/s per channel. 
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CLAIMS: 


1 . An audio encoding scheme for a stream that carries audio and video data, 
which scheme has a mean effective audio frame length F that equals the video frame 

length over an integral number M video frames, by provision of audio frames variable 
/ Jv 

in length F in a defined sequence FO) at encoding. 

5 

2. An encoding scheme according to claim 1 in which the frame length F is 
adjusted by varying an overlsqp O between successive audio fitimes. 

3. An encoding scheme according to claim 1 or claim 2 in which the 

10 value F(J) repeats periodically on j\ the periodicity of FU) defining a sequence of frames. 

4. An encoding scheme according to claim 3 having M video and N audio 
fi:ames per sequence, each audio frame being composed of k blocks of t samples each. 

15 5. An encoding scheme accordiag to claim 4 in which a total overlap Orbetween 

frames in the sequence is equal to O^, = /7xO+^x(0+l), where O is an overlap length in 

blocks where peJ^A^eKAOeXAOy-eK. 

6. An encoding scheme accordiag to claim 5 in which only audio frames 
20 corresponding to a particular video frame are overlapped. 

7. An encoding scheme according to claim 6 in which p:={N- M)x (O + 1) - 
and q = {N-M)- p. 


25 


8, An encoding scheme according to claim 5 in which only audio frames 

corresponding to a particular video sequence are overlapped. 
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9. An encoding scheme according to claim 8 in which (iV - l)x (O + 1)- Oy. 
and q = {N—l)-p. 

10. An encoding scheme according to claim 5 in which any adjacent audio frames 
5 are overlapped. 

11. An encoding scheme according to claim 10 in which /> = iV x (O + 1)- Oj- and 


10 12. An encoding scheme according to any one of claims 4 to 1 1 in 

whichBweK* :nxt=^Mx 


13. An audio encoding scheme for a stream that encodes audio and video data in 
which scheme audio samples of N quasi video-matched frames are encoded in frames with a 

15 semi-variable overlap whereby the effective length of the audio frames coincides with the 
length of a sequence of M video frames, where M and N are positive integers. 

14. A data stream encoded by a scheme according to any preceding claim. 

20 1 5. A data stream according to claim 14 which includes audio frames, each of 

which is tagged to indicate the size of the audio frame. 

16. A data stream according to claim 14 or claim 1 5 which includes audio frames, 
each block of which is tagged to indicate whether or not the block is a redundant block. 

25 

17. An audio encoder for coding audio for a stream that carries audio and video 
data in which flie encoder produces audio frames of variable length such that a mean 
effective audio frame length F equals the video frame length J/S over an integral number 

M video and N audio firames, by provision of audio frames variable overlap to have an 
30 effective in length F in a defined sequence FU) at encoding. 
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1 8. An audio encoder according to claim 17 for coding a stream having a short 

overlap of length O and a total of q long overlaps in a sequence, the encoder calculating the 
head overlap using an algorithm that repeats after N frames. 

5 19. An audio decoder for decoding a stream that encodes audio and video data, 

which decoder calculates an expected effective frame length of an incoming frame, adjusts 
the actual length of the incoming frame to make it equal to the expected frame length, 
determines whether any block within a received frame is a redundant block or a non- 
redundant block, mapping the non-redimdant blocks onto sub-band samples. 

10 

20. An audio decoder according to claim 19 which modifies the overlap status of 

blocks in the data stream by application of one or more of a set of block operators to each 
block. 

15 21. An audio decoder according to claim 20 in which the set of operators includes 

one or more of: NOP, an operator that does not change the status of a blocks; DROP, an 
operator that changes the first non-redundant block from the head overlap into a redundant 
block; APPEND, an operator that changes the first redundant block from the tail overlap into 
a non-redvmdant block; and SHIFT, an operator that is a combination of both DROP and 

20 APPEND operators. 
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ePO - DG 1 

ABSTllACT: 

18.01.2002 

An audio encoding scheme or a stream that encodes audio and video data is 
disclosed. The scheme has particular appUcation in mezzanine-level coding in digital 
television broadcasting. The scheme has a mean effective audio frame length F that equals 
the video frame length over an integral number M video frames, by provision of audio 

frames variable in length F in a defined sequence where length = F(J) at encoding. The 
length of the audio frames may be varied by altering the length of overlap between adjacent 
frames in accordance with an algorithm that repeats after a sequence of M frames. An 
encoder and a decoder for such a scheme are also disclosed. 


10 Fig. 4 
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